In dit LC gaan we een 16S rRNA dataset bekijken. Hiervoor gebruiken we Rstudio met een aantal dedicated packedges. Hierbij moeten we achterhalen welke bactieren er aanwezig waren in het sample, en hoeveel er van de bacterien aanwezig was. Elke indidividuele read vertegenwoordigd een observatie.
Een belangrijke stap bij het opwerken van dergelijk data is het bepalen welke sequenties biologisch relevant zijn, en welke technische ruis zijn (sequencing errors).
Voor elke student is een rstudio omgeving aangemaakt op eem amazon server. Op deze server zijn alle benodigde R packeges al geinstalleerd. Je kan de server hier vinden:
http://ec2-3-120-193-137.eu-central-1.compute.amazonaws.com/ name: name_lastname pass: oibmw
Je kan de rmarkdown file (rmd) waarmee deze pagina is gemaakt en bijbehorende data downloaden door met git de repository te clonen. In Rstudio kunnen jullie de doen via projects => Version control => Git => https://github.com/AMCMC/OiBMW_Microbiome. Je moet wel nog even een folder aanmaken in je eigen home folder.
Als je deze rmd file opent in Rstudio kun je de losse “chuncks” runnen.
Most of the tutorial steps are fixed and code can be run as is. However, a few options are left for the students to modify in order to explore the data, and test the effects of different choices on the algoritm performance. There are several questions along the tutorial.
Questions are indicated in bold. Plz adress these questions here or in a seperate word document
Plz highlight your answers using:
<p style=“color:red”> Your answer !</p>
Which will show as this in the final html as:
Your answer !
Alternatively you can generate a pdf export form a word document if preferred. Documents can be uploaded in the module on Canvas before April ? /check canvas.
Note: It is not required to read and understand all code, but rather get a grasp of how the data is being processed and how this might impact downstream results
Amplicon sequencing is the most commonly used approach to determine the microbiome composition of a particular environment. Specific primers targeting universal genes are used to amplify phylogenetic divergent marker sequences. The obtained amplified sequences (amplicons) can than be sequenced using various platforms. The main task when analyzing these datasets is to determine which are true biological sequences and which are technical noise.
Fig 1. Single step 16S rRNA gene amplicon Illumina library preparation
During this practical we will analyze some 16S rRNA amplicon sequences obtained from a mock sample. Amplicons were generated using universal primers targeting the V3V4 regions of the 16S rRNA gene and were sequenced on an Illumina MiSeq platform (2x251) using V3 chemistry. The sample used is a mock sample which was artificially created using DNA from in total 55 unique bacterial strains (Table 1 - MC3). Since we know the true composition of the sample we can use it benchmark our wetlab protocols and bioinformatic pipelines.
During this workshop will apply DADA2 to infer the taxonomic composition of the data and compare that to the expected composition. DADA2 is a software package specifically developed to infer the true amplicon seqeuence variants (ASV) of a particular sample. Part of this workshop was based on the tutorial from the dada2 page.
Other packages used: For visualization we use ggplot2. To merge and organize microbiome data we use phyloseq DECIPHER and phangorn are used to generate a phylogenetic tree.
All dependancies are preinstalled in the server, however if you wish to run it on your own machine, the following code will install everything required locally.
.cran_packages <- c("ggplot2", "reshape2", "stringr","phangorn","DT","ape","ggrepel","knitr","Hmisc")
.bioc_packages <- c("dada2", "phyloseq","ShortRead","DECIPHER","ggtree")
sapply(c(.cran_packages, .bioc_packages), require, character.only = TRUE, quietly = T)
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:ape':
##
## zoom
## The following objects are masked from 'package:base':
##
## format.pval, units
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:phyloseq':
##
## distance
## The following object is masked from 'package:grDevices':
##
## windows
##
## Attaching package: 'Biostrings'
## The following objects are masked from 'package:Hmisc':
##
## mask, translate
## The following object is masked from 'package:ape':
##
## complement
## The following object is masked from 'package:base':
##
## strsplit
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
## The following object is masked from 'package:phyloseq':
##
## sampleNames
## The following object is masked from 'package:Hmisc':
##
## contents
##
## Attaching package: 'ShortRead'
## The following object is masked from 'package:Hmisc':
##
## zoom
## The following object is masked from 'package:ape':
##
## zoom
## ggtree v2.4.2 For help: https://yulab-smu.top/treedata-book/
##
## If you use ggtree in published research, please cite the most appropriate paper(s):
##
## 1. Guangchuang Yu. Using ggtree to visualize data on tree-like structures. Current Protocols in Bioinformatics, 2020, 69:e96. doi:10.1002/cpbi.96
## 2. Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods for mapping and visualizing associated data on phylogeny using ggtree. Molecular Biology and Evolution 2018, 35(12):3041-3043. doi:10.1093/molbev/msy194
## 3. Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
##
## Attaching package: 'ggtree'
## The following object is masked from 'package:Biostrings':
##
## collapse
## The following object is masked from 'package:IRanges':
##
## collapse
## The following object is masked from 'package:S4Vectors':
##
## expand
## The following object is masked from 'package:ape':
##
## rotate
## ggplot2 reshape2 stringr phangorn DT ape ggrepel knitr
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## Hmisc dada2 phyloseq ShortRead DECIPHER ggtree
## TRUE TRUE TRUE TRUE TRUE TRUE
theme_set(theme_bw())
The initial part of the DADA2 workflow consists of preprocessing of the raw fastq files. Data quality is assessed, and several pipeline parameters can be tweaked for optimal analysis.
DADA2 paired-end read handling is somewhat quirky, the forward and reverse reads are processed independently and only at the end of the workflow are they merged into a single representative sequence.
First have a look at the size and quality of the data.
FQF <- "./Raw_data/L3_MOCK1.F.fastq.gz"
FQR <- "./Raw_data/L3_MOCK1.R.fastq.gz"
plotQualityProfile(c(FQF,FQR))
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
So we see the sample has 74333 paired end reads. Furthermore, the quality of the reverse read is poorer than that of the forward read.
A bit more intuitive way to look at the quality of the reads, is to look at the expected error rate. The following code reads the phred quality score from the fastq files and calculates the overall expected error for each of the reads. We will only take 1000 reads and plot the cumulative error rate along the read. The quality score is the log transformed likelihood that a particular call was erroneous.
# get the expected error rates and generate the data.frame
cumelative_error_F <- apply(10^(-as.matrix(PhredQuality(quality(readFastq(FQF)[1:1000])))/10),1,cumsum)
cumelative_error_R <- apply(10^(-as.matrix(PhredQuality(quality(readFastq(FQR)[1:1000])))/10),1,cumsum)
# convert the data into long format (required for ggplot2)
cumelative_error_F_long <- data.frame(reshape2::melt(cumelative_error_F), read="Forward")
cumelative_error_R_long <- data.frame(reshape2::melt(cumelative_error_R), read="Reverse")
df.long <- rbind(cumelative_error_F_long, cumelative_error_R_long)
#df.long$Var1 <- factor(ceiling(df.long$Var1/10)*10) # bin the cycles into ba
ggplot(df.long, aes(x=Var1, y=value)) +
labs(y="Cumulative expected error rate", x="Cycle number") +
facet_wrap(~read) +
geom_smooth(stat = 'summary', color = 'red', fill = 'red', alpha = 0.2,
fun.data = median_hilow, fun.args = list(conf.int = 0.9)) +
geom_smooth(stat = 'summary', color = 'blue', fill = 'blue', alpha = 0.2,
fun.data = median_hilow, fun.args = list(conf.int = 0.50)) +
#scale_y_log10() +
geom_hline(yintercept = 1.35)
ee <- 1.35
sum(c(rowMedians(cumelative_error_F)<ee),rowMedians(cumelative_error_R)<ee)
## [1] 454
sum(rowMedians(cumelative_error_F)<ee)
## [1] 238
sum(rowMedians(cumelative_error_R)<ee)
## [1] 216
This shows that very few sequences are actually without error and the error rate increases almost exponentially. The quality of the data tells us that we can expect, on average, 2 errors in the forward read and 3 in the reverse read. This causes an enormous amount of technical sequence variants. Initially this issue was resolved by clustering sequences within a arbitrary sequence distance. Usually this distance would be set at 97% similarity. With an amplicon of 430 base pairs, that would mean sequences with up to 12 positional differences would be taken together in a single representative cluster or Operational taxonomic unit (OTU). However the last 5 years we have been using algorithms called denoisers, which allow us to discriminate between biologically relevant sequences and noise.
If we assume that most errors are random, we can reason that these errors will results in unique, singleton sequences, while sequences that are replicated are more likely to be true variants. Dada2 requires a minimal of two observations of a sequence for it be considered to be a true variant. We use the derepFastq function from DADA2 to dereplicate the forward and reverse reads.
drr <- derepFastq(FQF) # derep forward
drfut <- derepFastq(FQR) # derep reverse
Lets check the statistics of the forward read. The drr holds the results of the dereplication of the forward read. Within the derep class object there is a map, which shows which sequence belongs together (drr$map)
drr # check object
## derep-class: R object describing dereplicated sequencing reads
## $uniques: 74333 reads in 59118 unique sequences
## Sequence lengths: min=251, median=251, max=251
## $quals: Quality matrix dimension: 59118 251
## Consensus quality scores: min=7, median=35, max=38
## $map: Map from reads to unique sequences: 53754 5696 9443 25284 17014 ...
length(drr$map) # number of unique sequences
## [1] 74333
table(table(drr$map)) # distribution of abundances of these sequences
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 56871 1200 387 184 121 58 41 25 31 19 16 12 8
## 14 15 16 17 18 19 20 21 22 23 24 25 26
## 14 9 11 5 3 3 5 4 9 2 1 2 2
## 27 28 29 30 31 32 33 34 35 36 37 38 39
## 2 1 1 2 3 1 1 3 1 2 2 3 1
## 40 41 42 47 48 51 56 57 61 68 70 71 73
## 2 2 2 1 3 1 1 1 1 1 1 1 2
## 74 76 79 81 83 89 91 92 98 101 103 105 107
## 1 1 1 1 1 1 1 1 1 1 1 1 1
## 109 113 116 120 150 152 156 169 209 219 268 305 370
## 1 2 1 1 2 1 1 1 1 1 1 1 1
## 387 459 541 975 1533
## 1 1 1 1 1
As you can see, out of the 74333 forward reads a total of 56871 (76%) are unique and only occur once! This is an issue for amplicon inference using dada2, because we can only infer ASVs from duplicated reads.
What about the reverse read
Luckily most of the errors are postioned at the end of the reads, and we can remove some of the poor quality regions at the end of the reads. This will reduce the number of unique sequences and will allow for more possible variants to be called.
The length of the V3V4 amplicons are between 400 and 430 bases while intotal we sequenced 2x251 bases. This means there is a 70-100 basepair overlap between the forward and reverse read. In order to merge the forward and reverse reads downstream, we need a minimal 20 bp overlap. Thus the sum of the read length must be at least 450 bp (forward + reverse = 450).
What would be appropriate read trim length? Hint: check the cummalative expected error rate plot above
Answer?
We can trim and filter our reads using the dada2 filterAndTrim function.
In contrast to the dada2 tutorial we do not filter on expected errors. We have found that sequence quality differs between amplicons and therefore quality filtering will skew the final composition.
length_trimmed_forward=240 # not optimal, check the error profiles for optimal trimming
length_trimmed_reverse=240 # not optimal, check the error profiles for optimal trimming
# filtered output files
FQFF <- "FQFF.gz"
FQRF <- "FQRF.gz"
out <- filterAndTrim(FQF, FQFF, FQR, FQRF,
truncLen=c(length_trimmed_forward,length_trimmed_reverse),
maxN=0, truncQ=0, rm.phix=F, compress=TRUE)
out
## reads.in reads.out
## L3_MOCK1.F.fastq.gz 74333 74333
NOTE: Individual statistics will change when different filtering criteria are used! Differences will indicate if you made a better choise for trimming
Check the duplication rate after trimming.
drf <- derepFastq(FQFF)
drf
## derep-class: R object describing dereplicated sequencing reads
## $uniques: 74333 reads in 55655 unique sequences
## Sequence lengths: min=240, median=240, max=240
## $quals: Quality matrix dimension: 55655 240
## Consensus quality scores: min=7, median=35, max=38
## $map: Map from reads to unique sequences: 50532 5878 9482 24071 16532 ...
sum(table(drf$map)==1)
## [1] 53128
drr <- derepFastq(FQRF)
drr
## derep-class: R object describing dereplicated sequencing reads
## $uniques: 74333 reads in 65323 unique sequences
## Sequence lengths: min=240, median=240, max=240
## $quals: Quality matrix dimension: 65323 240
## Consensus quality scores: min=7, median=32, max=38
## $map: Map from reads to unique sequences: 36597 234 65199 23704 29362 ...
sum(table(drr$map)==1)
## [1] 63749
What happened tot the expected error rate and read duplication?
Answer?
The second step in the process is the estimation of the batch specific error rates. Use of the quality score to test the validity of a sequence is what sets dada2 apart from the other popular ASV inference methods like Deblur and UNOISE3.
Error estimation is somewhat computational demanding and therefore pre-calculated error rates for this particular sequence run are supplied. (Otherwise error rates can be obtained using ?learnErrors). The following plots can be used to asses the error models. They should show linear decrease in substitution rates over increasing quality scores.
errF <- readRDS("./Raw_data/L3_dada2.errF.RDS")
plotErrors(errF, nominalQ=TRUE)
## Warning: Transformation introduced infinite values in continuous y-axis
errR <- readRDS("./Raw_data/L3_dada2.errR.RDS")
plotErrors(errR, nominalQ=TRUE)
## Warning: Transformation introduced infinite values in continuous y-axis
Points show the quality score dependent substitution rates while the red line shows the expected rates.
What do you observe regarding the expected vs the real error frequency?
Answer?
The third step in the dada2 workflow is the actual inference of the ASVs using the dereplicated samples and the corresponding error models. For this we call the dada function.
asv.f <- dada(derep = drf, err = errF)
## Sample 1 - 74333 reads in 55655 unique sequences.
asv.f
## dada-class: object describing DADA2 denoising results
## 217 sequence variants were inferred from 55655 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
asv.r <- dada(derep = drr, err = errR)
## Sample 1 - 74333 reads in 65323 unique sequences.
asv.r
## dada-class: object describing DADA2 denoising results
## 70 sequence variants were inferred from 65323 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
table(is.na(asv.f$map[drf$map]),is.na(asv.r$map[drr$map]))
##
## FALSE TRUE
## FALSE 65507 4889
## TRUE 2766 1171
sum(is.na(asv.f$map[drf$map]) | is.na(asv.r$map[drr$map]))
## [1] 8826
If you didnt change the trimming parameters DADA2 will have inferred 217 amplicon sequences from the forward reads and 70 from the reverse reads. Furthermore DADA2 did not find an suitable ASV representative for 18544 (25%) of the reads.
NOTE: asv.f\(map shows which dereplicated sequences map to a particular ASV, and asv.f\)map[drf$map] would give the mapping for each of the oringal sequences in the fastq (not dereplicated)
How do the forward and reverse read differ in performance?Answer?
How could trimming differently improve these statistics?
Answer?
Old school OTU clustering methods rely on fixed sequence distances.
Lets have a look at the read to ASV mapping. I will select an arbitrary forward ASV and trace which sequences are mapping to this particular ASV. Both the derep and dada-class objects have a mapping file. Lets first retrieve the reads from the fastq file.
x=24 # arbitrairy ASV ID
asv.f$denoised[x] # ASV Sequence
## TGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGAAAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGATGGGCAAGTCTGATGTGAAAACCCGGGGCTCAA
## 485
F_ASV_4.FQ <- Biostrings::readQualityScaledDNAStringSet(FQFF)[drf$map %in% which(asv.f$map %in% x)] # reads mapping to forward ASV x
In total 485 reads map to this ASV.
Let check the divergence of the reads with the actual ASV. When we align the reads with the ASV we can calculate the hamming distance. The hamming distance should represent the true sequencing error for each read.
read_asv_ham_dist <- nwhamming(as.character(F_ASV_4.FQ), asv.f$sequence[x])
summary(read_asv_ham_dist)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 5.064 7.000 34.000
barplot(table(read_asv_ham_dist)); abline(v = 240*0.03, col="red") #240*0.03 = 7.2 ~ 97% similarity @ 240 basepairs
sum(read_asv_ham_dist>7.2)
## [1] 119
As we can see most of the reads have at least one error and roughly 119 (20%) of the reads have more than 3% error rate.
If we take this particular subset of sequences and would run a OTU clustering based approach, what would the results look like?
Answer?
Lets assume this was the sole bacterium in our sample and the selected reads was our limited dataset. What would an OTU clustering approach tell us?
First, we would get a distance matrix for all sequences. Here we will align all vs all reads and calculate the hamming distance (nwhamming). The code is very inefficient and just for show casing the sequence variance within the ASV. However with ever increasing datasets sizes, this was becoming a limiting factor clustering based approaches.
seq_abun <- data.frame(abundance=table(as.character(F_ASV_4.FQ)))
seq_abun$hamming_distance <- nwhamming(as.character(seq_abun$abundance.Var1), asv.f$sequence[x])
asv.dist.mat <- lapply(as.character(seq_abun$abundance.Var1), function(x) nwhamming(as.character(seq_abun$abundance.Var1), x))
We can visualize a clustering based approach by generating a phylogentic tree for these particular reads. Cluster the sequences in a hierarchical tree and color the tips according to the OTU cluster.
simmat <- do.call(rbind, asv.dist.mat)/430
tree <- hclust(as.dist(simmat))
OTU <- cutree(tree = tree, h = 0.03) # ~97% similarity
aggregate(seq_abun$abundance.Freq, by=list(OTU), FUN=sum) # count the number or reads per "OTU"
## Group.1 x
## 1 1 1
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 1
## 6 6 1
## 7 7 285
## 8 8 1
## 9 9 1
## 10 10 1
## 11 11 8
## 12 12 3
## 13 13 1
## 14 14 1
## 15 15 4
## 16 16 17
## 17 17 3
## 18 18 2
## 19 19 9
## 20 20 2
## 21 21 5
## 22 22 1
## 23 23 5
## 24 24 1
## 25 25 1
## 26 26 1
## 27 27 1
## 28 28 1
## 29 29 4
## 30 30 15
## 31 31 1
## 32 32 3
## 33 33 1
## 34 34 1
## 35 35 1
## 36 36 1
## 37 37 1
## 38 38 1
## 39 39 7
## 40 40 1
## 41 41 1
## 42 42 1
## 43 43 1
## 44 44 1
## 45 45 10
## 46 46 1
## 47 47 1
## 48 48 3
## 49 49 1
## 50 50 5
## 51 51 3
## 52 52 11
## 53 53 1
## 54 54 1
## 55 55 1
## 56 56 2
## 57 57 1
## 58 58 1
## 59 59 1
## 60 60 1
## 61 61 2
## 62 62 2
## 63 63 1
## 64 64 1
## 65 65 1
## 66 66 1
## 67 67 2
## 68 68 1
## 69 69 1
## 70 70 1
## 71 71 1
## 72 72 1
## 73 73 1
## 74 74 2
## 75 75 1
## 76 76 2
## 77 77 1
## 78 78 1
## 79 79 1
## 80 80 1
## 81 81 2
## 82 82 1
## 83 83 1
## 84 84 1
## 85 85 1
## 86 86 1
## 87 87 1
## 88 88 1
## 89 89 1
## 90 90 1
## 91 91 1
## 92 92 1
## 93 93 1
## 94 94 1
sort(aggregate(seq_abun$abundance.Freq, by=list(OTU), FUN=sum)[,2]) # sort by OTU abundance
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [20] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [58] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
## [77] 3 3 3 3 3 4 4 5 5 5 7 8 9 10 11 15 17 285
plot.phylo(as.phylo(tree), direction = "downwards", show.tip.label = T, tip.color = rainbow(length(unique(OTU)))[OTU])
As you can see 94 unique OTUs would be formed if we would cluster the reads at 97% identity.
What would this scattering of the sequences have for implactions in downstream analysis of the data For example in terms of alpha diversity
Answer?
The error model already showed that the acual sequence error rate was probably higher than we could anticipate from the quality profiles. We can now also explore the expected error rate versus the true error rate based on the reads selected.
plot(apply(10^(-as.matrix(PhredQuality(F_ASV_4.FQ@quality))/10),1,sum), read_asv_ham_dist,
xlab="Expected Error", ylab="True Errors")
What do you observe in terms of expected error versus true error rate?
Answer?
To get a single representative ASV sequence, forward and reverse ASV pairs are merged. Only those pairs that can be joined with at least a 20 bases overlap and without any mismatches are accepted. For this we use the mergePairs function.
asv.m <- mergePairs(dadaF = asv.f, derepF = drf, dadaR = asv.r, derepR = drr, returnRejects = T, maxMismatch = 0)
## Duplicate sequences in merged output.
DT::datatable(asv.m)
This table shows the statistics for all the asv pairs. Forward and reverse read mismatching is a common problem of Illumina sequencing. Therefore significant amount of reads are lost in this process.
How many ASVs do we have and how many reads do they represent?
length(asv.m$accept) # number of possible ASV pair combinations
## [1] 3592
sum(asv.m$accept) # number of ASV accepted
## [1] 372
sum(asv.m[asv.m$accept,]$abundance) # number of reads in accepted ASV
## [1] 52071
sum(asv.m[!asv.m$accept,]$abundance) # number of reads in not accepted ASVs
## [1] 13436
Out of 3592 possible combinations 372 valid merged ASVs are generated. A total of 13436 (15%) reads belong to read pairs which do not properly merge. In total 30 % of the reads have been lost during processing of the data.
During PCR of the library prep and bridge amplification before sequencing, potential chimeric sequences are generated. These hybrid sequences generate artificial link in the phylogenetic relation of these ASVs. Therefore the ASVs are screened for chimeras and are usually removed from the data. V3V4 amplicons have a conserved region in the middle and are therefore especially prone to chimera formation.
Fig 2. Chimeric sequences
DADA2 has a function isBimeraDenovo to detect these chimeric sequences.
seqtab <- makeSequenceTable(asv.m[asv.m$accept,])
bimeras <- isBimeraDenovo(seqtab)
asv.m$bimera <- bimeras[asv.m$sequence]
sum(bimeras)
## [1] 289
sum(na.omit(asv.m$abundance[asv.m$bimera]))
## [1] 8055
A total of 289 ASVs were identified as possible chimeric sequences, representing 8055 reads (15%). When we remove these we will have completed the ASV inference and we will get our final ASV abundances.
seqtab.nochim <- seqtab[,names(which(!bimeras))]
length(seqtab.nochim)
## [1] 83
sum(seqtab.nochim)
## [1] 44016
In the end a total of 83 valid ASVs were inferred representing 44016 reads (57%). This means we lost a total 43% of the reads while processing the data. These are quite common statistics.
How could the algoritm statistics be improved? How could the loss of reads impact interpretation of the data? Note: each individual read was considered an observation
Answer?
To determine the accuracy and precision of our obtained profile, we compare the ASVs to to the sequences of the 55 reference strains. The reference sequences for each member have been obtained by sanger sequencing. Sanger sequencing does not allow to discriminate between isoforms and therefore the reference sequences contain some ambiguous basecalls. Thus we can expect some discrepancies between the ASV and the reference sequences which are due to errors in the reference rather than to errors in the ASVs.
The reference sequences contain the entire 16S gene. In order to compare it to the data we can perform an in silico pcr to determine the expected ASV.
asv.valid.seqs <- names(seqtab.nochim)
# read in the sequences of the mock
mockref <- as.character(readDNAStringSet("./Raw_data/Mock_reference_NoN.fasta"))
insilicopcr <- list()
for(ref in names(mockref)){
refseq <- mockref[ref]
start <- vmatchPattern(pattern = "CCTACGGGAGGCAGCAG", subject = refseq, #forward V3 primer seq
max.mismatch=1, min.mismatch=0,
with.indels=FALSE, fixed=TRUE,
algorithm="auto")
end <- vmatchPattern(pattern = "GGATTAGATACCCTTGTAGTC", subject = refseq, # reverse V4 primer seq
max.mismatch=5, min.mismatch=0,
with.indels=FALSE, fixed=TRUE,
algorithm="auto")
start.i <- endIndex(start)[[1]]+1
end.i <- startIndex(end)[[1]]-1
if (!(isEmpty(start.i) | isEmpty(end.i))){insilicopcr[[ref]] <- Biostrings::subseq(refseq, start = start.i, end = end.i)}
}
mock.comp <- data.frame(read.csv("./Raw_data/Mock.composition.txt", sep=",", header = T, row.names = 1)) # get expected abundances
mock.comp$ASV <- unname(unlist(insilicopcr)) #include expected ASV sequence
Lets check the exact overlap between the expected amplicon sequences and the obtained ASV sequences
mock.comp$ASV %in% names(seqtab.nochim)
## [1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
## [13] TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [25] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## [37] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## [49] TRUE TRUE FALSE FALSE TRUE TRUE TRUE
#DT::datatable(mock.comp)
As we can see many of the MOCK in silico amplicons have identical sequences to the obtained ASVs. Lets see how well they actually match by calculating the pairwise hamming distance between the sample and mock members.
hamming.list <- lapply(asv.valid.seqs, function(x) nwhamming(x, mock.comp$ASV))
hamming.distance.mat <- do.call(rbind, hamming.list)
colnames(hamming.distance.mat) <- rownames(mock.comp)
rownames(hamming.distance.mat) <- asv.valid.seqs
#hamming.distance.mat.long <- setNames(melt(hamming.distance.mat), c('asv', 'mock', 'dist'))
Get the closest relative for each mock and asv
mockrefdist <- data.frame(dist = colMins(hamming.distance.mat),
name = names(mockref))
ggplot(mockrefdist, aes(x=name, y=dist)) +
geom_bar(stat="identity") +
coord_flip() +
labs(y="Minimum hamming distance", title="Distance mock to representive asv (Recall)", x=NULL)
54 of the mock members have a perfect or good representative ASV in the sample. Only a single species, MC_2 Micrococcus MC_2 does not have a good matching ASV. Recall is calculated as TP/(TP+FN), which in this case is 54/55=98%. So we can say that we have a recall of 98% in detection.
Besides recall another import measure is precision. In this case how many of the ASVs are actually derived from the mock. Lets assume that all ASVS with a higher than five hamming distance to the mock members are false positives.
Note, our reference sequences are imperfect so we need allow for some mismatching.
asvrefdist <- data.frame(dist = rowMin(hamming.distance.mat),
name = paste0("ASV_",stringr::str_pad(as.character(1:length(asv.valid.seqs)), width=3, pad = "0")))
ggplot(asvrefdist, aes(x=name, y=dist)) +
geom_bar(stat="identity") +
coord_flip() +
ggplot2::geom_hline(yintercept = 5, col='red') +
labs(y="Minimum hamming distance", title="Distance asv to mock representive (Precision)", x=NULL)
# False positive ASVs
sum(asvrefdist$dist<=5)
## [1] 66
# False positive ASVs total reads
#sum(seqtab.nochim[asvrefdist$dist<=5])
As we can see there were quiet some ASVs which are not likely to be derived from the mock members. In binary terms, ie presence or absence, we can say we have a precision of 66/80 which is 83%. Most of these are actually stealthy chimeras.
What would the precision be if we consider the reads rather than the ASVs? Hint sum(seqtab.nochim[asvrefdist$dist<=5])
Since ASV sequences by themselves are none informative we need to put them into context. We use phylogenetic trees to put the sequences in evolutionary context, while we use taxonomy in order to put them in scientific context.
Taxonomy is how we classify organisms into specific groups at certain ranks. Taxonomy tries to capture phylogeny (evolutionary relationships) at a coarse level.
Fig 3. Main taxonomic ranks
Lets assign taxonomy to the inferred ASV sequences. DADA2 implements a variant of the (RDP) naive bayes classifier. In short kmer profiles from the ASV sequences are compared to those in a curated taxonomy reference database. In this case the reference database is Silva V132.
tax <- assignTaxonomy(seqs = asv.valid.seqs,"./Raw_data/silva_nr_v132_train_set.fa.gz")
# lets inlcude the Genus taxonomy in the name of the ASV
asvrefdist$name_genus <- paste(asvrefdist$name,tax[asv.valid.seqs,"Genus"])
A phylogenetic tree tries to capture the evolutionary relationships of various organisms. We can use the sequence (dis)similarity of the ASVs to infer these phylogenetic relationships. These trees help in generating context for our inferred sequences.
seqs <- c(asv.valid.seqs,mock.comp$ASV)
names(seqs) <- c(asvrefdist$name_genus,mockrefdist$name)
alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA)
## Determining distance matrix based on shared 8-mers:
## ================================================================================
##
## Time difference of 0.17 secs
##
## Clustering into groups by similarity:
## ================================================================================
##
## Time difference of 0.03 secs
##
## Aligning Sequences:
## ================================================================================
##
## Time difference of 1.54 secs
##
## Iteration 1 of 2:
##
## Determining distance matrix based on alignment:
## ================================================================================
##
## Time difference of 0.01 secs
##
## Reclustering into groups by similarity:
## ================================================================================
##
## Time difference of 0.02 secs
##
## Realigning Sequences:
## ================================================================================
##
## Time difference of 0.64 secs
##
## Iteration 2 of 2:
##
## Determining distance matrix based on alignment:
## ================================================================================
##
## Time difference of 0.01 secs
##
## Reclustering into groups by similarity:
## ================================================================================
##
## Time difference of 0.03 secs
##
## Realigning Sequences:
## ================================================================================
##
## Time difference of 0.17 secs
phang.align <- phyDat(as(alignment, "matrix"), type="DNA")
dm <- dist.ml(phang.align)
treeNJ <- NJ(dm)
fit = pml(treeNJ, data=phang.align)
## negative edges length changed to 0!
fitGTR <- update(fit, k=4, inv=0.2)
#fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,
# rearrangement = "stochastic", control = pml.control(trace = 0))
fitGTR$tree <- midpoint(fitGTR$tree)
Lets visualize the tree; sequences in out mock are colored red, ASVs which correspond to a true member of the mock are colored green, while ASVs which are unlikely derived from the mock community (Xeno) are colored blue.
groupInfo <- rep("Silva",length(fitGTR$tree$tip.label))
groupInfo[fitGTR$tree$tip.label %in% asvrefdist[asvrefdist$dist>=5,]$name_genus] <- "Xeno_ASV"
groupInfo[fitGTR$tree$tip.label %in% asvrefdist[asvrefdist$dist<5,]$name_genus] <- "Valid_ASV"
groupInfo[substr(fitGTR$tree$tip.label,1,3)=="MC_"] <- "MOCK"
groupInfo <- split(fitGTR$tree$tip.label, groupInfo)
fitGTR$tree <- groupOTU(fitGTR$tree, groupInfo)
test <- fitGTR$tree
attr(test, "Source") <- attributes(fitGTR$tree)$group
p <- ggtree(test, layout = "rectangular")+ geom_tiplab(size=3, aes(color=Source),key_glyph = "rect", ) + ggplot2::xlim(0, 0.3)
## Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
## Also defined by 'tidytree'
p
We can see that most of the ASVs and mock sequences occur in pairs and their associated taxonomies are mostly congruent. The sequences that do not get classified at Genus level (NA) are all labeled xeno, and are likely stealthy chimeras. However there apear some valid sequences that are not originating from the mock.
What could these sequences represent?
Answer?
Various steps in generating and processing of the data results in bias in the observed final composition (abundance). Since taxonomy is too coarse grained and potentially biased, lets cluster the ASVs based on the distance in the tree.
dd = as.dist(cophenetic.phylo(fitGTR$tree))
psclust = cutree(hclust(dd), h = 0.05)
I’ve group the tree tips (ASVs, and insilico PCR amplicons) at a semi arbitrary distance of 0.05. How would the results change if I would use a much higher or lower cutoff?
Answer?
Lets aggregate the ASV abundance and mock composition, based on the clustering of the tree. Once we’ve grouped the ASV and mock sequences we can compare the expected versus the observed relative abundances.
seqtab.nochim.relabu <- seqtab.nochim/sum(seqtab.nochim)
Abundances <- c(seqtab.nochim.relabu, mock.comp$MC3/100)
names(Abundances) <- names(seqs)
cluster_list <- split(names(psclust), psclust)
names(cluster_list) <- unlist(lapply(cluster_list, function(x) names(which.max(Abundances[x])))) # representative sequence
cl.abundance_sample <- aggregate(seqtab.nochim.relabu, by=list(psclust[c(asvrefdist$name_genus)]), FUN=sum)
cl.abundance_mock <- aggregate(mock.comp$MC3/100, by=list(psclust[c(mockrefdist$name)]), FUN=sum)
rownames(cl.abundance_sample) <- names(cluster_list)[cl.abundance_sample[,1]]
rownames(cl.abundance_mock) <- names(cluster_list)[cl.abundance_mock[,1]]
df <- data.frame(label=names(cluster_list),
sample=cl.abundance_sample[names(cluster_list),2],
mock=cl.abundance_mock[names(cluster_list),2]
)
df[is.na(df)] <- 0
ggplot(df, aes(x=sample, y=mock)) +
geom_point(size=3) +
ggrepel::geom_text_repel(aes(label=label)) +
geom_abline(slope = 1) +
labs(x="Measured Abundance", y="Predicted Abundance") +
#scale_x_log10() +
#scale_y_log10() +
NULL
## Warning: ggrepel: 27 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
While we can observe some bias for specific groups like Akkermansia and Roseburia overall there is significant correlation betweem the observed and expected abundance.
How does the observed species specific bias impact the data? For example; We observe a lot more akkermansia reads than we measured. Note, we measured a fixed set of reads